Introduction to the p-value

  • The p-value is a key concept in hypothesis testing, used to quantify the evidence against a null hypothesis.
  • In this presentation, we will:
    • Define the p-value and explain its significance.
    • Explore a real-world example using a dataset of real estate prices.
    • Perform hypothesis testing to determine if waterfront properties have higher prices than non-waterfront properties.
  • Tools and visualizations will include:
    • Statistical testing in R.
    • Interactive and static plots with plotly and ggplot2.

What is the p-value?

  • The p-value is the probability of observing results as extreme as (or more extreme than) the ones observed, assuming the null hypothesis is true.
  • It helps answer the question: “Are the observed data due to chance, or is there an actual effect?”

Key Points:

  1. Small p-value (e.g., ≤ 0.05):
    • Strong evidence against the null hypothesis.
    • We may reject the null hypothesis.
  2. Large p-value (> 0.05):
    • Weak evidence against the null hypothesis.
    • Fail to reject the null hypothesis.

Formula for Test Statistic:

\[ T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \] Where: - \(\bar{X}_1, \bar{X}_2\): Sample means. - \(s_1^2, s_2^2\): Sample variances. - \(n_1, n_2\): Sample sizes.

Real-World Example: Real Estate Data

Source : https://www.kaggle.com/code/fareedalianwar/house-price/input?select=data.csv

  • Dataset Overview

The dataset contains property sales records, including features such as: Price: Sale price of the property. Bedrooms, Bathrooms: Number of bedrooms and bathrooms. sqft_living: Square footage of the living area. Waterfront: Whether the property has a waterfront view (1 = Yes, 0 = No). And more…

Objective

  • Hypothesis: Do properties with a waterfront view have significantly higher average prices than properties without a waterfront view?

Hypotheses

  1. Null Hypothesis (\(H_0\)): The average prices of waterfront and non-waterfront properties are the same.
  2. Alternative Hypothesis (\(H_1\)): Waterfront properties have a higher average price than non-waterfront properties.

Data Exploration

Key Steps in Exploring the Data

  1. Summary Statistics:
    • Compare the average property prices for waterfront and non-waterfront properties.
    • Look at the spread of data (e.g., standard deviation, range).
  2. Visualizing the Data:
    • Plot price distributions for both groups to observe patterns.

Summary Sraristics Formula

  • Mean: \(\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\)
  • Standard Deviation: \(\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2}\)
  • Median: The middle value in the dataset when ordered.

Summary Statistics Example (R Code):

# Load necessary libraries
library(dplyr)
data <- read.csv("data.csv")
# Calculate summary statistics
summary_stats <- data %>%
  group_by(waterfront) %>%
  summarize(
    mean_price = mean(price, na.rm = TRUE),
    median_price = median(price, na.rm = TRUE),
    sd_price = sd(price, na.rm = TRUE),
    count = n()
  )

Results

print(summary_stats)
## # A tibble: 2 × 5
##   waterfront mean_price median_price sd_price count
##        <int>      <dbl>        <dbl>    <dbl> <int>
## 1          0    545462.       460000  547781.  4567
## 2          1   1451621.       988500 1425994.    33

Explanation:

Waterfront Properties: Have a higher average price ($1,451,621) compared to non-waterfront properties ($545,462), as well as a larger standard deviation. Sample Size: There are 33 waterfront properties compared to 4,567 non-waterfront properties, indicating a smaller sample for waterfront properties. This shows clear differences in the statistics, which we will use to test our hypothesis

Hypothesis Testing: Two-Sample t-Test

Objective:

We will use a two-sample t-test to test the hypothesis that waterfront properties have higher average prices than non-waterfront properties.

Null Hypothesis (\(H_0\)):

The mean price of waterfront properties is equal to the mean price of non-waterfront properties.
\(H_0: \mu_{\text{waterfront}} = \mu_{\text{non-waterfront}}\)

Alternative Hypothesis (\(H_1\)):

The mean price of waterfront properties is higher than the mean price of non-waterfront properties.
\(H_1: \mu_{\text{waterfront}} > \mu_{\text{non-waterfront}}\)

We will perform a one-tailed t-test to test this hypothesis.

Code to Perform the Two-Sample t-Test

# Perform the t-test
t_test_result <- t.test(price ~ waterfront,
      data = data, alternative = "greater")

# Display the results
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  price by waterfront
## t = -3.6485, df = 32.068, p-value = 0.9995
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
## 95 percent confidence interval:
##  -1326837      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        545462.3       1451621.2

t-Test Explanation

  • t-statistic: -3.6485

    The negative t-statistic indicates that, on average, waterfront properties have lower prices than non-waterfront properties. This is the opposite of our hypothesis that waterfront properties would have higher prices.

  • p-value: 0.9995

    The p-value is extremely high (greater than 0.05), meaning there is very little evidence to reject the null hypothesis. We cannot conclude that waterfront properties have higher prices based on this data.

  • 95% Confidence Interval: [-1,326,837, Inf]

    The confidence interval for the difference in means includes negative values, further suggesting that the difference in prices could be in favor of non-waterfront properties.

Conclusion

  • Based on the results of the t-test:

    • p-value = 0.9995, which is much larger than the significance level of 0.05.
    • We fail to reject the null hypothesis that the mean price of waterfront and non-waterfront properties is the same.
  • Interpretation: There is no significant evidence that waterfront properties have higher average prices than non-waterfront properties in this dataset.

  • The analysis suggests that the prices of waterfront properties might not differ as expected from non-waterfront properties based on the data provided.

plotly plot

library(plotly)
plot_ly(data = data, x = ~sqft_living, y = ~price, z = ~waterfront, type = 'scatter3d', mode = 'markers')

ggplot1

library(ggplot2)

# Price distribution plot
ggplot(data, aes(x = factor(waterfront), y = price)) +
  geom_boxplot() +
  labs(title = "Price Distribution: Waterfront vs Non-Waterfront")

ggplot2

# Scatter plot of sqft vs price
ggplot(data, aes(x = sqft_living, y = price)) +
  geom_point(aes(color = factor(waterfront))) +
  labs(title = "Price vs Square Footage")

Final Thoughts

  • Key Takeaways:

    The p-value is a critical tool in hypothesis testing, helping us assess the strength of evidence against the null hypothesis.

    In our real estate example, the p-value of 0.9995 led us to fail to reject the null hypothesis, meaning there’s no significant evidence that waterfront properties have higher prices than non-waterfront ones. This shows the importance of understanding and interpreting statistical results properly, especially in real-world applications like real estate.

  • Implications:

While the initial hypothesis was not supported, the analysis emphasizes the need to thoroughly evaluate data and assumptions before drawing conclusions. The example demonstrates how statistical tools can help make data-driven decisions in industries like real estate.

Thank you for your attention!